Skills: Sampling and confidence intervals
Week 2
In general, the data sets you work with represent samples of larger populations. Your goal is to be able to generalize from your sample to the population.
Sometimes this is literally and obviously true. For example, you might have sent a survey to one percent of all Canadian households, and you want to analyze the data to make some claims about the characteristics of Canadian households in general.
Other times, describing your sample in terms of the population it comes from is not as obvious. For example, you might have data about every county in the United States suggesting that incomes are higher in urban counties than in rural counties. You might argue that this is not a sample, because you have data on every county that exists. However, if you want to claim that this difference is based on a fundamental difference that is likely to always exist between urban and rural counties in the United States, then you need to be able to generalize this difference to a broader population, which would be all hypothetical counties that could ever exist in the United States.
This week, you’ll continue to work with the data you generated last week. This dataset represents the full population you are studying.
You will be doing the following:
- Draw samples from the full population
- Calculate descriptive statistics of your sample
- Use the descriptive statistics to make guesses about the values would be for the full population.
- Check your guesses against the values you calculated last week.
Data
We’ll start with a dataset of 10,000 observations like the one you generated for your assignment last week.
R
Here, I’ll load a csv file with the dataset into my R environment
library(here)
library(tidyverse)
library(knitr)
full_data <- here("week1",
"full-data.csv") |>
read_csv()Here are the first few rows:
| sq_feet | dt_dist | color | pool | blue | green | rent |
|---|---|---|---|---|---|---|
| 1195.304 | 3.505617 | Red | 1 | 0 | 0 | 961.4318 |
| 1502.807 | 4.577164 | Red | 0 | 0 | 0 | 923.1070 |
| 1839.383 | 2.606373 | Red | 0 | 0 | 0 | 1257.2496 |
| 1405.328 | 3.343034 | Blue | 0 | 1 | 0 | 1016.5779 |
| 1390.682 | 4.262791 | Red | 0 | 0 | 0 | 860.3381 |
| 1318.790 | 4.128393 | Red | 0 | 0 | 0 | 816.7330 |
Excel
I’ll use a spreadsheet with a completed example of last week’s assignment as a starting point.
You can find the spreadsheet here (click the download button).
Here is the sheet with the dataset.
Adding random error to an outcome
Remember that we calculated the outcome variable directly from the predictor variables, so the outcome is perfectly predicted by those predictors. To make things a little more realistic, let’s add some random noise to the outcome.
R
In R, you can use the rnorm() function to add random
variation to a variable. Here I’ll add a random value to the variable
representing monthly rent. This extra value has an average of zero (so
negative numbers will be as common as positive numbers) and a standard
deviation of 100 (most of the values in a normal distribution fall with
about two standard deviations of the mean, so this is like making the
outcome equal to the calculated value, plus or minus 200).
Excel
In Excel, you can add some noise to a variable using the by generating a new random variable that varies between 0 and 1:
And then adding a term to the outcome formula that will be a normally-distributed variable with a mean of zero and a standard deviation of 200.
Drawing a sample
Let’s say this full dataset represents the full population I’m interested in, but I don’t have access to the full dataset. I only have access to a sample of 100 observations.
Here is one example of a sample I might have access to.
R
In R, I can draw a random sample of 100 observations from a larger
dataframe using the function sample_n().
Excel
In Excel, I can take the following steps to take a random sample from a larger data set.
First, I add an index variable with row numbers.
Then, I create an index column for my sample data set. If I want to sample 100 observations, I generate 100 numbers from a random uniform distribution ranging between 1 and 10,000 (the size of the population I’m drawing from). I copy and paste the value of the variables (so they won’t keep recalculating new random values), and check to make sure there are no duplicate values (if there were, I’d need to generate a few more valued until I had 100 unique values).
Now, I can use a the VLOOKUP() function to get the
values for the rows indicated by my random indices.
Calculating the sample average
We can take the average value of each variable in our sample, with the hope that this will tell us something about the average in the population it came from.
R
Here’s how you would generate a neat little table with the average value for each variable in your sample.
means <- sample_1 |>
pivot_longer(cols = where(is.numeric),
names_to = "Variable",
values_to = "Value") |>
group_by(Variable) |>
summarise(Average = mean(Value))
kable(means)| Variable | Average |
|---|---|
| blue | 0.400000 |
| dt_dist | 4.304352 |
| green | 0.110000 |
| pool | 0.140000 |
| rent | 995.198130 |
| sq_feet | 1494.528730 |
Excel
And here’s how you would calculate the averages from the sample in Excel.
Guessing the population average (continuous variables)
You can use a one-sample t-test to make a reasonable guess of what the average value is for the population a sample comes from. A t-test will give you a range of values that the average of the full population might be within, at a given level of confidence.
R
Here is how you would extract the lower and upper values of a 95-percent confidence interval for average values in R.
means <- sample_1 |>
select(dt_dist, rent, sq_feet) |>
pivot_longer(cols = where(is.numeric),
names_to = "Variable",
values_to = "Value") |>
group_by(Variable) |>
summarise(Average = mean(Value),
`C.I. Low` = t.test(Value, conf.level = 0.95)$conf.int[1],
`C.I. High` = t.test(Value, conf.level = 0.95)$conf.int[2])
kable(means)| Variable | Average | C.I. Low | C.I. High |
|---|---|---|---|
| dt_dist | 4.304352 | 4.165545 | 4.443158 |
| rent | 995.198130 | 962.540279 | 1027.855981 |
| sq_feet | 1494.528730 | 1467.646248 | 1521.411211 |
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard deviation times the z-score corresponding to the confidence level you want.
So, first you calculate the standard deviation of the sample.
Then you calculate the z-score for the desired confidence level using
the function NORM.INV.S(). The probability you should use
for a 95% confidence level is 0.975. Or, if you use a probability of
0.025, you’ll get the opposite (i.e. negative) of the z-score you
need.
Now you can calculate the upper and lower bounds of the confidence interval.
Guessing the population proportions (categorical variables)
Confidence intervals for proportions work the same way as confidence intervals for averages, but you use the standard error of the proportion instead of the standard deviation.
R
In R, you use prop.test rather than t.test when you want to find the confidence interval for a proportion.
means <- sample_1 |>
select(pool, blue, green) |>
pivot_longer(cols = where(is.numeric),
names_to = "Variable",
values_to = "Value") |>
group_by(Variable) |>
summarise(Average = mean(Value),
`C.I. Low` = prop.test(x = sum(Value==1),
n = n(),
conf.level = 0.95)$conf.int[1],
`C.I. High` = prop.test(x = sum(Value==1),
n = n(),
conf.level = 0.95)$conf.int[2])
kable(means)| Variable | Average | C.I. Low | C.I. High |
|---|---|---|---|
| blue | 0.40 | 0.3047801 | 0.5029964 |
| green | 0.11 | 0.0588672 | 0.1922335 |
| pool | 0.14 | 0.0814039 | 0.2271163 |
Excel
To calculate the 95-percent confidence interval in Excel, you’d need to know that the confidence margin is the standard error times the z-score corresponding to the confidence level you want.
The standard error is given by the equation:
\[ S.E. = \sqrt{\frac{p\times(1-p)}{n}} \]
Where p is the proportion and n is the number of observations.
Then you calculate the z-score and the upper and lower bounds of the confidence interval as you would for a continuous variable.
Calculating regression coefficients
We can calculate regression coefficients for our predictor variables based in sample data, with the hope that this will inform our understanding of the way these variables relate to our outcome for the full population.
R
In R, you can estimate a linear regression mode using the
lm() function.
Here is how you can view the model coefficients:
## (Intercept) sq_feet dt_dist pool blue green
## 58.6207468 0.7271854 -55.1680183 195.7650168 112.6074978 134.4571913
Excel
Here is how you would generate regression coefficients and their associated standard errors (a standard error is sort of like a standard deviation) in Excel.
First, select a block of cells with two rows and one more column than the number of predictor variables you want in your model.
With all those cells selected, type the formula
=LINEST() with four arguments: the outcome variable, the
set of predictor variables, TRUE (to indicate that you also need to
estimate the constant), and TRUE (to indicate that you also need
standard errors)
Once you’ve entered in the formula, hold down the control and shift keys (or the command and shift keys on a Mac) while you hit Enter (or Return on a Mac).
The cells you’ve selected will populate with values. The first row will be estimated coefficients and the second row will be standard errors.
The last column will be the constant. The rest of the columns will be for the variables in your set of predictors, in the reverse order of how they appear in your dataset.